Expert-level aspiration and penetration detection during flexible endoscopic evaluation of swallowing with artificial intelligence-assisted diagnosis

Flexible endoscopic evaluation of swallowing (FEES) is considered the gold standard in diagnosing oropharyngeal dysphagia. Recent advances in deep learning have led to a resurgence of artificial intelligence-assisted computer-aided diagnosis (AI-assisted CAD) for a variety of applications. AI-assisted CAD would be a remarkable benefit in providing medical services to populations with inadequate access to dysphagia experts, especially in aging societies. This paper presents an AI-assisted CAD named FEES-CAD for aspiration and penetration detection on video recording during FEES. FEES-CAD segments the input FEES video and classifies penetration, aspiration, residue in the vallecula, and residue in the hypopharynx based on the segmented FEES video. We collected and annotated FEES videos from 199 patients to train the network and tested the performance of FEES-CAD using FEES videos from other 40 patients. These patients consecutively underwent FEES between December 2016 and August 2019 at Fukushima Medical University Hospital. FEES videos were deidentified, randomized, and rated by FEES-CAD and laryngologists with over 15 years of experience in performing FEES. FEES-CAD achieved an average Dice similarity coefficient of 98.6\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\%$$\end{document}%. FEES-CAD achieved expert-level accuracy performance on penetration (92.5\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\%$$\end{document}%), aspiration (92.5\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\%$$\end{document}%), residue in the vallecula (100\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\%$$\end{document}%), and residue in the hypopharynx (87.5\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\%$$\end{document}%) classification tasks. To the best of our knowledge, FEES-CAD is the first CNN-based system that achieves expert-level performance in detecting aspiration and penetration.


Experiments
A series of breakthroughs in image classification using CNN have laid a foundation for networks for semantic segmentation, and the continuous optimization of networks for semantic segmentation has gradually improved the accuracy of medical image segmentation. For baseline comparisons, we first run experiments on both convolutional and transformer-based methods. For convolutional baselines, we used HRNet 1 , DeepLabv3+ 2 , VNet 3 , UNet++ 4 , ResUNet-a 5 , and U 2 Net 6 . For transformer-based baselines, we used three state-of-the-art methods named Att UNet 7 , TransUNet 8 , and Swin-UNet 9 . All networks was developed using Python 3.9 and the Tensorflow 2.6.0 framework, and was trained on an Intel Core i7-10700F processor CPU, 32.0 GB RAM, and GeForce RTX 3090.

Datasets
All networks was trained by 25,630 expert-annotated images from 199 FEES videos and tested on 40 FEES videos. Videos were deidentified, randomized, and rated by an expert panel of laryngologists as well as dysphagia experts with over 15 years of experience performing FEES. The expert panel was blinded to all identification information about the examinees and examined the videos in real time combined with frame-by-frame analysis.

Training Methodology
The images size are set as 256 × 256, and training images were augmented through zooming and random color jitter (brightness, contrast, saturation, and hue). All networks were trained from scratch and initialized weights with the He-normal initializer 10 .
Let y ci denote a one-hot encoding scheme of ground truth, where p ci is the predicted value by the prediction model for each class, and indices c and i iterate over all classes and pixels, respectively. Generalized Dice loss (L GD ) 11 , generalized Tversky loss (L GT ), and categorical cross-entropy loss (L CCE ) are computed as follows,  Figure S1. (a) A sample FEES image, its corresponding (b) ground truth, and a qualitative comparison of (c) customized UNet, (d) DeepLabv3+, (e) TransUNet, and (f ) Swin-UNet. The proposed method achieves desirable segmentation results close to ground truth.
To choose a suitable loss function, we performed evaluations under four scenarios: L GD , L GT , a combination of L GD and L CCE (L CCE+GD ), and a combination of L GT and L CCE (L CCE+GT ). All network minimize loss with the adaptive moment estimation (ADAM) optimizer 12 . All networks were trained for 100 epochs using an ADAM optimizer with a batch size of 8 because neither of the networks improved after 100 epochs. The initial learning rate is dependent on the networks and decays 1e-4 if validation loss does not improve after four epochs.

RESULTS
The quantitative segmentation performance are evaluated by the Sensitivity (%), Speci f icity (%), Dice similarity coe f f icient (DSC, %), and 95% Hausdor f f distance(HD95) defined in the main paper.

Results on Test FEES Videos
The segmentation performance of the customized UNet and the comparison with other networks are shown in Table S1. We have four major observations from Table 1. First, the customized UNet is the best network for segmenting FEES videos in our experiments. The customized UNet trained with L CCE+GT achieved superior performance to other netowrks in terms of DSC and specificity. Second, we observed that a loss function for a given network is important. In terms of DSC, HRNet and ResUNet-a achieved their best segmentation performance with L CCE , Swin-UNet achieved its best segmentation performance with L CCE+GD , DeepLabv3+, VNet, UNet++, U 2 Net, Att UNet, TransUNet, and the customized UNet achieved their best segmentation performance with L CCE+GT , respectively. Third, balancing between the sensitivity and specificity is important. DeepLabv3+ trained with L CCE+GT attained the highest sensitivity of 99.58 % and low specificity, resulting in lower DSC than the customized UNet. Fourth, modifying the original UNet for binary medical segmentation may turn out to reduce the segmentation performance on multiclass segmentation tasks. Convolutional VNet, UNet++, ResUNet-a, and U 2 Net and fully attention baselines Att UNet, TransUNet, and Swin-UNet had less segmentation performance to varying degrees compared with  Fig. 1 shows an example of FEES image 1(a), its corresponding segmentation result 1(b), and segmentation results from customized UNet 1(c), DeepLabv3+ 1(d), TransUNet 1(e), and Swin-UNet 1(f). The image was extracted from a test video and the customized UNet outperformed all baselines. This test video was from a patient with no penetration, aspiration, or residue in the hypopharynx. The customized UNet successfully segmented the residue in the patient's vallecula, and also predicted no residue in the aspiration area or penetration area. Baseline networks, by contrast, predicted there were test bolus stuck in the aspiration area (Deeplabv3+, TransUNet, and Swin-UNet) or (TransUNet and Swin-UNet) penetration area. As shown in Table 1, Deeplabv3+ achieved the second-best DSC. In this example, it also reached the second-best segmentation result. Deeplabv3+ only wrongly segmented some pixels in the aspiration area as residue, which had little impact on judgment making by laryngologists. In contrast, TransUNet and Swin-UNet wrongly segmented a lot of pixels in the aspiration area and penetration area. Segmentation of test bolus increased the contrast of fluids in the video recording during FEES, and thus enhanced the detection of aspiration and penetration. While the instant green "flickering" inside the hypopharynx or vallecula could be ignored, the instant green "flickering" appeared in the aspiration area or penetration area could lead to very different judgment of aspiration and penetration. A network with a high specificity should be given a higher preference due to practical consideration.

Discussion
The endoscope is an essential part of FEES. Illumination has always been a great challenge for every type of endoscope and is now more important than ever for a network because using images under proper illumination can make learning easier (e.g., avoid low contrast and use structured light to obtain depth information) 13 . However, it is inevitable that many images are captured under poor illumination conditions in a FEES image sequence. Fig. 2 shows an example of an improperly illuminated FEES image 2(a) that exhibited characteristics like low brightness and low contrast. In this example with no residue in the aspiration area, penetration area, hypopharynx, and vallecula, the customized UNet 2(c) achieved a desirable segmentation result which is the same as the ground truth 2(b). On the contrary, DeepLabv3+ 2(d) and Swin-UNet 2(f) incorrectly classified some pixels in the hypopharynx as test bolus, and TransUNet 2(e) wrongly predicted some pixels in the aspiration and penetration areas as test bolus. Segmentation increases the contrast of the test bolus in the FEES image. The false prediction of test bolus in a low contrast image is particularly attractive and may lead to improper judgment. None class pixels occupied many frames in FEES videos. This again indicates that a network with a simple architecture and high specificity is desired for practical consideration.

Concurrent Works
HRNet and DeepLabv3+ can maintain high-resolution representations. VNet, UNet++, ResUNet-a, and U 2 Net can learn rich multi-scale contextual information from the mixture of receptive fields of different sizes. Transformer-based UNet variants Att UNet, TransUNet, and Swin-UNet can capture both global and long-range semantic information interactions. It was observed that the aforementioned network is not feasible for the FEES videos segmentation which is a multiclass class-imbalance problem. Unlike these works, we confirmed the feasibility of simply applying an encoder-decoder with skip connections for FEES image segmentation without needing pre-training. Quantitative and qualitative results imply that the customized UNet is suitable for supporting diagnosis decision-making in FEES videos.